Fake News Classification - EDA

This notebook is dedicated to explanatory data analysis of Fake News Dataset. To avoid data leakage, EDA is performed only on train set, while test set is discarded for this purposes. Please note that test set is not explicitly stated, so we randomly choose our own and utilize it through further work.

1. Imports

2. Data loading and preprocessing

Please note that all news without title of text are discarded.

Explanatory data analysis is performed only on train set, which in this case consists of 67% of all available news chosen with random state of 256.

3. Data visualization

The amount of available data is huge and impossible to comprehend by human's mind. However, thanks to WordCloud it is possible to compress information to a human-friendly plot. In this notebook WordCloud are created for two obvious cases:

and additionally for the following ones with the intention of checking, whether fake news have some easy to spot red flags;

All news' titles

Comments and spot-on observations:

All news' texts

Comments and spot-on observations:

Titles of fake vs. real news

Comments and spot-on observations:

In fact, there is slight tendency of utilizing words "video" and "Breitbart", although it is not significant enough to be viewed as a single indicator of being a fake news.

Texts of fake vs. real news

Comments and spot-on observations:

Now let's check, if there are other words occuring more often in either category of news.

Comments and spot-on observations: